{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import sklearn.ensemble\n",
    "import torch\n",
    "import pandas\n",
    "import joblib\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "from lib import model, glucose_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "This notebook provides a brief walkthrough of the public code release for our KDD 2018 paper: Deep Multi-Output Forecasting: Learning to Accurately Predict Blood Glucose Trajectories. The full paper is available via arXiv: https://arxiv.org/abs/1806.05357. We hope to release our glucose data to the general public soon. In the meantime, people interested in blood glucose forecasting may be interested in the recently released OhioT1DM dataset: http://smarthealth.cs.ohio.edu/OhioT1DM-dataset.html."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data\n",
    "We have included both the processed and unprocessed dataset used to generate our results. This data was collected by authors Mamta Jaiswal, Dr. Lynn Ang, and Dr Rodica Pop-Busui."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unprocessed\n",
    "The unprocessed dataset, data/unprocessed_cgm_data.xlsx, is an excel file with one sheet per recording session (from baseline to 36 months). Each row is one individual, note that patient ids are consistent across recording sessions, and not all patients have all recording sessions. The CGM data is giving at 5 minute resolution. The unprocessed data also contain information on the daily insulin dose and delivery method, which was not used in the paper. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unprocessed = pandas.read_excel('data/unprocessed_cgm_data.xlsx', sheet_name=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unprocessed.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unprocessed['Baseline']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processed\n",
    "The processed data is stored as four pickle files (accessible via joblib), data/processed_cgm_data_{train/validation/test}.pkl and data/processed_cgm_coeffs.pkl. To process we:\n",
    "\n",
    "1. Remove data points which differ from previous ones by more than 40 mg/dL, as these measurements are almost certainly the result of sensor error\n",
    "2. Impute small data gaps using linear interpolation.\n",
    "3. Split data into contiguous chunks, splitting either on missing data or when a chunk is >101 measurements long\n",
    "4. (PolyMO) compute coefficient bins on the training data.\n",
    "\n",
    "The test set is constructed using the most recent session from each patient (approximately 10% of the data). \n",
    "\n",
    "We also include a differently processed version of the data, data/alternative_cgm_data_{train/test}, which we found useful for other projects. This data is constructed on a per-day basis, removing days with excessive missingness. Importantly, each day is linked to the ID of the patient it came from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_tr = glucose_dat_train_rec = glucose_dataset.GlucoseDataset(data_pkl='data/processed_cgm_data_train.pkl',\n",
    "                                                                 max_pad=101,\n",
    "                                                                 output_len=6, # set 1 for Recursive, 6 for MO\n",
    "                                                                 output_dim=361,\n",
    "                                                                 polynomial=False,\n",
    "                                                                 degree=2,\n",
    "                                                                 range_low=0,\n",
    "                                                                 range_high=100,\n",
    "                                                                 coeff_file='/data2/ifox/glucose/data/training_coefficient_percentiles_ridge_alpha1_roc40.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for x, y_index, y_real, lens in data_tr:\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The polynomial fitting takes a while (several minutes), but is only required once before training\n",
    "data_tr_poly = glucose_dat_train_rec = glucose_dataset.GlucoseDataset(data_pkl='data/processed_cgm_data_train.pkl',\n",
    "                                                                      max_pad=101,\n",
    "                                                                      output_len=6,\n",
    "                                                                      output_dim=361,\n",
    "                                                                      polynomial=True,\n",
    "                                                                      degree=2,\n",
    "                                                                      range_low=0,\n",
    "                                                                      range_high=100,\n",
    "                                                                      coeff_file='/data2/ifox/glucose/data/training_coefficient_percentiles_ridge_alpha1_roc40.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for x, poly_index, y_real, lens in data_tr_poly:\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Models\n",
    "\n",
    "Our paper considers 8 classes of models:\n",
    "\n",
    "Shallow Baselines\n",
    "* Extrapolation\n",
    "* Recursive Random Forest\n",
    "* Multi-Output Random Forest\n",
    "\n",
    "Deep Baselines\n",
    "* Recursive RNN\n",
    "* Multi-Output RNN\n",
    "\n",
    "Our Approaches\n",
    "* Sequential Multi-Output RNN\n",
    "* Polynomial Multi-Output RNN\n",
    "* Polynomial Sequential Multi-Output RNN\n",
    "\n",
    "We will walk through how we implemented, trained, and evaluated each model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shallow Baselines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extrapolation\n",
    "\n",
    "This is a simple linear extrapolation baseline implemented via Numpy. We extrapolate using the last 30 minutes (6 samples as our data was sampled at 5 minute intervals) to predict 30 minutes into the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_tr = np.cumsum(np.random.randn(1000, 16), axis=1)\n",
    "data_ts = np.cumsum(np.random.randn(100, 10), axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_input = 6\n",
    "horizon = 6\n",
    "degree = 1\n",
    "extrap_pred = []\n",
    "for i in range(len(data_ts)):\n",
    "    coeffs = np.polynomial.polynomial.polyfit(x=np.arange(n_input), y=data_ts[i][-n_input:], deg=degree)\n",
    "    extrap_pred.append(np.polyval(p=np.flip(coeffs, axis=0), x=np.arange(horizon)+n_input))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recursive and Multi-Output Random Forest\n",
    "\n",
    "Implemented using scikit-learn. Note the scikit-learn implementation automatically infers output size during the fitting step. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Recursive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_rec = sklearn.ensemble.RandomForestRegressor(n_estimators=100, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note, for actually training recursive models, you should use all of the data by taking input_size tiles\n",
    "X_rec_tr = data_tr[:, :10]\n",
    "y_rec_tr = data_tr[:, 10:11].ravel()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_rec.fit(X_rec_tr, y_rec_tr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# recursive prediction\n",
    "X_mod = data_ts.copy()\n",
    "p_rec_arr = []\n",
    "for i in range(6):\n",
    "    p = rf_rec.predict(X_mod)\n",
    "    p_rec_arr.append(p.reshape(-1, 1))\n",
    "    X_mod = np.concatenate((X_mod[:, 1:], p.reshape(-1, 1)), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multi-Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note, for actually training recursive models, you should use all of the data by taking input_size tiles\n",
    "X_mo_tr = data_tr[:, :10]\n",
    "y_mo_tr = data_tr[:, 10:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_mo = sklearn.ensemble.RandomForestRegressor(n_estimators=100, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_mo.fit(X_mo_tr, y_mo_tr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_mo_arr = rf_mo.predict(data_ts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Models\n",
    "Our deep baselines are all implemented in PyTorch. They are a bit more involved to train. The basic training procedure is outlined in lib/trainer.py in the ExperimentTrainer class. The train_sup function is used to fit the provided model. The use of TensorboardX is not required, but convenient for monitoring losses. The data is assumed to be in the form of a pytorch dataset in the form of lib/glucose_dataset.py (though the specifics can vary greatly).\n",
    "\n",
    "Note that the dataset code requires precomputed polynomial coefficients for the PolyMO setting. This can be done using Numpy's polyfit function on your training data. \n",
    "\n",
    "The cuda flag should be set to True if a GPU is available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recursive Baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rec_rnn = model.RecursiveRNN(input_dim=1, output_dim=361, hidden_size=512, depth=2,  cuda=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi-Output Baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mo_rnn = model.MultiOutputRNN(input_dim=1, output_dim=361, output_len=6, hidden_size=512, depth=2, cuda=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sequential Multi-Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seqmo_rnn = model.MultiOutputRNN(input_dim=1, output_dim=361, output_len=6, hidden_size=512, depth=2, cuda=False, sequence=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Polynomial Multi-Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "polymo_rnn = model.MultiOutputRNN(input_dim=1, output_dim=361, output_len=6, hidden_size=512, depth=2, cuda=False, polynomial=True, degree=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Polynomial Sequential Multi-Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "polymo_rnn = model.MultiOutputRNN(input_dim=1, output_dim=361, output_len=6, hidden_size=512, depth=2, cuda=False, sequence=True, polynomial=True, degree=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
